# Zero-shot Image Classification
Fg Clip Base
Apache-2.0
FG-CLIP is a fine-grained visual and text alignment model that achieves global and region-level image-text alignment through two-stage training.
Text-to-Image
Transformers English

F
qihoo360
692
2
Openvision Vit Base Patch16 224
Apache-2.0
OpenVision is a fully open, cost-effective family of advanced visual encoders focused on multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
79
0
Openvision Vit Large Patch14 224
Apache-2.0
OpenVision is a fully open, cost-effective family of advanced vision encoders focused on multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
308
4
Vit Gopt 16 SigLIP2 256
Apache-2.0
SigLIP 2 vision-language model trained on WebLI dataset, suitable for zero-shot image classification tasks.
Text-to-Image
V
timm
43.20k
0
Vit SO400M 14 SigLIP2
Apache-2.0
A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.
Text-to-Image
V
timm
1,178
0
Vit L 16 SigLIP2 384
Apache-2.0
A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.
Text-to-Image
V
timm
581
0
Vit B 16 SigLIP2
Apache-2.0
A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.
Text-to-Image
V
timm
11.26k
0
Siglip2 So400m Patch16 Naflex
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
159.81k
21
Siglip2 Base Patch16 Naflex
Apache-2.0
SigLIP 2 is a multilingual vision-language encoder that integrates SigLIP's pretraining objectives and introduces new training schemes, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
10.68k
5
Siglip2 So400m Patch16 512
Apache-2.0
SigLIP 2 is a vision-language model based on SigLIP, enhanced with improved semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
46.46k
18
Siglip2 So400m Patch16 384
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
7,632
2
Siglip2 So400m Patch16 256
Apache-2.0
SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
2,729
0
Siglip2 Giant Opt Patch16 384
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pretraining objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
26.12k
14
Siglip2 Giant Opt Patch16 256
Apache-2.0
SigLIP 2 is an advanced vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
3,936
1
Siglip2 Large Patch16 384
Apache-2.0
SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
6,525
2
Siglip2 Large Patch16 256
Apache-2.0
SigLIP 2 is an improved vision-language model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
10.89k
3
Siglip2 Base Patch16 512
Apache-2.0
SigLIP 2 is a vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
28.01k
10
Siglip2 Base Patch16 384
Apache-2.0
SigLIP 2 is a vision-language model based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction through a unified training approach.
Image-to-Text
Transformers

S
google
4,832
5
Siglip2 Base Patch16 256
Apache-2.0
SigLIP 2 is a multilingual vision-language encoder with improved semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text
Transformers

S
google
45.24k
4
Siglip2 Base Patch16 224
Apache-2.0
SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text
Transformers

S
google
44.75k
38
Siglip2 Base Patch32 256
Apache-2.0
SigLIP 2 is an improved version of SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
9,419
4
Mme5 Mllama 11b Instruct
MIT
mmE5 is a multimodal multilingual embedding model trained on Llama-3.2-11B-Vision, improving embedding performance through high-quality synthetic data and achieving state-of-the-art results on the MMEB benchmark.
Multimodal Fusion
Transformers Supports Multiple Languages

M
intfloat
596
18
Genmedclip B 16 PMB
MIT
A zero-shot image classification model based on the open_clip library, specializing in medical field image analysis
Image Classification
G
wisdomik
408
0
Genmedclip
MIT
GenMedClip is a zero-shot image classification model based on the open_clip library, specializing in medical image analysis.
Image Classification
G
wisdomik
40
0
CLIP ViT L 14 Spectrum Icons 20k
MIT
A vision-language model fine-tuned based on CLIP ViT-L/14, optimized for abstract image-text retrieval tasks
Text-to-Image
TensorBoard English

C
JianLiao
1,576
1
Eva02 Large Patch14 Clip 336.merged2b
MIT
EVA02 CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.
Text-to-Image
E
timm
197
0
Vit So400m Patch16 Siglip 256.webli I18n
Apache-2.0
A vision Transformer model based on SigLIP, focusing on image feature extraction with original attention pooling mechanism.
Image Classification
Transformers

V
timm
15
0
Vit So400m Patch14 Siglip Gap 384.webli
Apache-2.0
Vision Transformer model based on SigLIP, utilizing global average pooling for image features
Image Classification
Transformers

V
timm
96
0
Vit So400m Patch14 Siglip 378.webli
Apache-2.0
A vision Transformer model based on SigLIP, containing only an image encoder, utilizing the original attention pooling mechanism.
Image Classification
Transformers

V
timm
82
0
Vit Base Patch16 Siglip 512.webli
Apache-2.0
Vision Transformer model based on SigLIP architecture, containing only the image encoder part, using original attention pooling mechanism
Image Classification
Transformers

V
timm
702
0
Vit So400m Patch14 Siglip 224.webli
Apache-2.0
Vision Transformer model based on SigLIP, containing only the image encoder part, utilizing original attention pooling mechanism
Image Classification
Transformers

V
timm
123
1
Vit Large Patch14 Clip 224.datacompxl
Apache-2.0
A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.
Image Classification
Transformers

V
timm
14
0
Vit Giant Patch14 Clip 224.laion2b
Apache-2.0
Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset
Image Classification
Transformers

V
timm
71
0
Convnext Base.clip Laion2b Augreg
Apache-2.0
ConvNeXt Base image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports image feature extraction
Image Classification
Transformers

C
timm
522
0
Convnext Base.clip Laion2b
Apache-2.0
CLIP image encoder based on ConvNeXt architecture, trained by LAION, suitable for multimodal vision-language tasks
Image Classification
Transformers

C
timm
297
0
CLIP SAE ViT L 14
MIT
A CLIP model fine-tuned with sparse autoencoder (SAE), excelling in zero-shot image classification tasks, particularly adept at recognizing adversarial typographic attacks
Text-to-Image
Transformers

C
zer0int
32
29
Vit Large Patch14 Clip 224.laion400m E31
MIT
A large Vision Transformer model trained on the LAION-400M dataset, supporting zero-shot image classification tasks
Image Classification
V
timm
21.15k
0
Vit Base Patch16 Plus Clip 240.laion400m E31
MIT
A vision-language dual-purpose model trained on the LAION-400M dataset, supporting zero-shot image classification tasks
Image Classification
V
timm
37.23k
0
Vit Base Patch16 Clip 224.metaclip 400m
A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks
Image Classification
V
timm
1,206
1
Vit Base Patch16 Clip 224.laion400m E32
MIT
Vision Transformer model trained on the LAION-400M dataset, compatible with both open_clip and timm frameworks
Image Classification
V
timm
5,751
0
- 1
- 2
- 3
Featured Recommended AI Models